195 research outputs found

    Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks.

    Get PDF
    Genetic studies are increasingly based on short noisy next generation scanners. Typically complete DNA sequences are assembled by matching short NextGen sequences against reference genomes. Despite considerable algorithmic gains since the turn of the millennium, matching both single ended and paired end strings to a reference remains computationally demanding. Further tailoring Bioinformatics tools to each new task or scanner remains highly skilled and labour intensive. With this in mind, we recently demonstrated a genetic programming based automated technique which generated a version of the state-of-the-art alignment tool Bowtie2 which was considerably faster on short sequences produced by a scanner at the Broad Institute and released as part of The Thousand Genome Project

    Genetic programming for mining DNA chip data from cancer patients

    Get PDF
    In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (<100) records (the patients). A GP based method for both feature selection and generating simple models based on a few genes is demonstrated on cancer data

    Evolving DNA motifs to predict GeneChip probe performance

    Get PDF
    Background: Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. We use correlations between measurements for the same gene across 6685 human tissue samples from NCBI's GEO database to indicated the quality of individual HG-U133A probes. Low correlation indicates a poor probe. Results: Regular expressions can be automatically created from a Backus-Naur form (BNF) context-free grammar using strongly typed genetic programming. Conclusion: The automatically produced motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided. © 2009 Langdon and Harrison; licensee BioMed Central Ltd

    Mycoplasma Contamination in The 1000 Genomes Project

    Get PDF
    Background: In silco Biology is increasingly important and is often based on public datasets. While the problem of contamination is well recognised in microbiology labs the corresponding problem of database corruption has received less attention. Results: Mapping 50 billion next generation DNA sequences from The Thousand Genome Project against published genomes reveals many that match one or more Mycoplasma but are not included in the reference human genome GRCh37.p5. Many of these are of low quality but NCBI BLAST searches confirm some high quality, high entropy sequences match Mycoplasma but no human sequences. Conclusions: It appears at least 7percent of 1000G samples are contaminated

    Kin selection with twin genetic programming

    Get PDF
    In steady state Twin GP both children created by sub-tree crossover and point mutation are used. They are born together and die together. Evolution is little changed. Indeed fitness selection using the twin’s co-conceived doppelganger is possible

    Failed disruption propagation in integer genetic programming

    Get PDF
    We inject a random value into the evaluation of highly evolved deep integer GP trees 9 743 720 times and find 99.7% of test outputs are unchanged. Suggesting crossover and mutation's impact are dissipated and seldom propagate outside the program. Indeed only errors near the root node have impact and disruption falls exponentially with depth at between e-depth/3 and e-depth/5 for recursive Fibonacci GP trees, allowing five to seven levels of nesting between the runtime perturbation and an optimal test oracle for it to detect most errors. Information theory explains this locally flat fitness landscape is due to FDP. Overflow is not important and instead, integer GP, like deep symbolic regression floating point GP and software in general, is not fragile, is robust, is not chaotic and suffers little from Lorenz' butterfly

    Dissipative Arithmetic

    Get PDF
    Large arithmetic expressions are dissipative: they lose information and are robust to perturbations. Lack of conservation gives resilience to fluc-tuations. The limited precision of floating point and the mixture of linear and nonlinear operations make such functions anti-fragile and give a largely stable locally flat plateau a rich fitness landscape. This slows long-term evolution of complex programs, suggesting a need for depth-aware crossover and mutation operators in tree-based genetic program-ming. It also suggests that deeply nested computer program source code is error tolerant because disruptions tend to fail to propagate, and there-fore the optimal placement of test oracles is as close to software defects as practical

    CSM-423 - Evolutionary Solo Pong Players

    Get PDF
    An Internet Java Applet http://www.cs.essex.ac.uk/staff/poli/ SoloPong/ allows users anywhere to play the Solo Pong game. We compare people?s performance to a hand coded ?Optimal? player and programs automatically produced by artificial intelligence. The AI techniques are: genetic programming, including a hybrid of GP and a human designed algorithm, and a particle swarm optimiser. The AI approaches are not fine tuned. GP and PSO find good players. Evolutionary computation (EC) is able to beat both human designed code and human players

    Optimising Existing Software with Genetic Programming

    Get PDF
    We show genetic improvement of programs (GIP) can scale by evolving increased performance in a widely-used and highly complex 50000 line system. GISMOE found code that is 70 times faster (on average) and yet is at least as good functionally. Indeed it even gives a small semantic gain

    Long-term evolution experiment with genetic programming [hot of the press]

    Get PDF
    We evolve floating point Sextic polynomial populations of genetic programming binary trees for up to a million generations. We observe continued innovation but this is limited by their depth and suggest deep expressions are resilient to learning as they disperse information, impeding evolvability and the adaptation of highly nested organisms and instead we argue for open complexity. Programs with more than 2 000 000 000 instructions (depth 20 000) are created by crossover. To support unbounded long-term evolution experiments LTEE in GP we use incremental fitness evaluation and both SIMD parallel AVX 512 bit instructions and 16 threads to yield performance equivalent of up to 1.1 trillion GP operations per second, 1.1 tera-GPops, on an Intel Xeon Gold 6136 CPU 3.00GHz server
    • …
    corecore